📑   $\mathfrak {\color{#228B22} {P6: Capstone \ Project.\ Sberbank \ Russian \ Housing \ Market}}$


Sberbank Russian Housing Market https://www.kaggle.com/c/sberbank-russian-housing-market

Model evaluation: quantifying the quality of predictions http://scikit-learn.org/stable/modules/model_evaluation.html

In [2]:
from IPython.core.display import HTML
hide_code = ''
HTML('''<script> code_show = true;

function code_display() {
    if (code_show) {
        $('div.input').each(function(id) {if (id == 0 || $(this).html().indexOf('hide_code') > -1) {$(this).hide();}
        });
        $('div.output_prompt').css('opacity', 0);
    } else { 
        $('div.input').each(function(id) {$(this).show();});
        $('div.output_prompt').css('opacity', 1);
    }
    code_show = !code_show;
}

$(document).ready(code_display);</script>
<form action="javascript: code_display()">
<input style="color: #228B22; background: ghostwhite; opacity: 0.9;"
type="submit" value="Click to display or hide code"></form>''')
Out[2]:
In [194]:
hide_code
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import pandas as pd
import scipy

import seaborn as sns
import matplotlib.pylab as plt

from random import random
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.model_selection import KFold, ParameterGrid, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, median_absolute_error, mean_absolute_error
from sklearn.metrics import r2_score, explained_variance_score
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.linear_model import Ridge, RidgeCV, BayesianRidge
from sklearn.linear_model import HuberRegressor, TheilSenRegressor, RANSACRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler
from sklearn.pipeline import Pipeline

import keras as ks
from keras.models import Sequential, load_model, Model
from keras.optimizers import SGD, RMSprop
from keras.layers import Dense, Dropout, LSTM
from keras.layers import Activation, Flatten, Input, BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Conv2D, MaxPooling2D
from keras.layers.embeddings import Embedding
from keras.wrappers.scikit_learn import KerasRegressor
In [123]:
hide_code
def regression(regressor, x_train, x_test, y_train):
    reg = regressor
    reg.fit(x_train, y_train)
    
    y_train_reg = reg.predict(x_train)
    y_test_reg = reg.predict(x_test)
    
    return y_train_reg, y_test_reg

def loss_plot(fit_history):
    plt.figure(figsize=(18, 6))

    plt.plot(fit_history.history['loss'], color='#348ABD', label = 'train')
    plt.plot(fit_history.history['val_loss'], color='#FF7F50', label = 'test')

    plt.legend()
    plt.title('Loss Function');  
    
def mae_plot(fit_history):
    plt.figure(figsize=(18, 6))

    plt.plot(fit_history.history['mean_absolute_error'], color='#348ABD', label = 'train')
    plt.plot(fit_history.history['val_mean_absolute_error'], color='#FF7F50', label = 'test')

    plt.legend()
    plt.title('Mean Absolute Error');   

def scores(regressor, y_train, y_test, y_train_reg, y_test_reg):
    print("_______________________________________")
    print(regressor)
    print("_______________________________________")
    print("EV score. Train: ", explained_variance_score(y_train, y_train_reg))
    print("EV score. Test: ", explained_variance_score(y_test, y_test_reg))
    print("---------")
    print("R2 score. Train: ", r2_score(y_train, y_train_reg))
    print("R2 score. Test: ", r2_score(y_test, y_test_reg))
    print("---------")
    print("MSE score. Train: ", mean_squared_error(y_train, y_train_reg))
    print("MSE score. Test: ", mean_squared_error(y_test, y_test_reg))
    print("---------")
    print("MAE score. Train: ", mean_absolute_error(y_train, y_train_reg))
    print("MAE score. Test: ", mean_absolute_error(y_test, y_test_reg))
    print("---------")
    print("MdAE score. Train: ", median_absolute_error(y_train, y_train_reg))
    print("MdAE score. Test: ", median_absolute_error(y_test, y_test_reg))
    
def scores2(regressor, target, target_predict):
    print("_______________________________________")
    print(regressor)
    print("_______________________________________")
    print("EV score:", explained_variance_score(target, target_predict))
    print("---------")
    print("R2 score:", r2_score(target, target_predict))
    print("---------")
    print("MSE score:", mean_squared_error(target, target_predict))
    print("---------")
    print("MAE score:", mean_absolute_error(target, target_predict))
    print("---------")
    print("MdAE score:", median_absolute_error(target, target_predict))

$\mathfrak {\color{#228B22} {1. \ Capstone \ Proposal \ Overview }}$

In this capstone project proposal, prior to completing the following Capstone Project, we will leverage what we've learned throughout the Nanodegree program to author a proposal for solving a problem of our choice by applying machine learning algorithms and techniques. A project proposal encompasses seven key points:

  • The project's domain background : the field of research where the project is derived;
  • A problem statement : a problem being investigated for which a solution will be defined;
  • The datasets and inputs : data or inputs being used for the problem;
  • A solution statement : a the solution proposed for the problem given;
  • A benchmark model : some simple or historical model or result to compare the defined solution to;
  • A set of evaluation metrics : functional representations for how the solution can be measured;
  • An outline of the project design : how the solution will be developed and results obtained.

$\mathfrak {\color{#228B22} {2. \ Domain \ Background }}$

Housing costs demand a significant investment from both consumers and developers. And when it comes to planning a budget—whether personal or corporate—the last thing anyone needs is uncertainty about one of their budgets expenses. Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about reality prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building.

Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge. Complex interactions between housing features such as a number of bedrooms and location are enough to make pricing predictions complicated. Adding an unstable economy to the mix means Sberbank and their customers need more than simple regression models in their arsenal.


$\mathfrak {\color{#228B22} {3. \ Problem \ Statement }}$

Sberbank is challenging programmers to develop algorithms which use a broad spectrum of features to predict real prices. Competitors will rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sberbank to provide more certainty to their customers in an uncertain economy.


$\mathfrak {\color{#228B22} {4. \ Datasets \ and \ Inputs }}$

4.1 Data Description (data_dictionary.txt)

In [3]:
hide_code
HTML('''<div id="data">
<p><iframe src="data_dictionary.txt" frameborder="0" height="300"width="97%"></iframe></p>
</div>''')
Out[3]:

4.2 Load and Display the Data

In [169]:
hide_code
macro = pd.read_csv('macro.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
In [170]:
hide_code
macro[100:107].T[1:15]
Out[170]:
100 101 102 103 104 105 106
oil_urals 82.87 82.87 82.87 82.87 82.87 82.87 82.87
gdp_quart 9995.8 9995.8 9995.8 9995.8 9995.8 9995.8 9995.8
gdp_quart_growth 4.1 4.1 4.1 4.1 4.1 4.1 4.1
cpi 319.8 319.8 319.8 319.8 319.8 319.8 319.8
ppi 350.2 350.2 350.2 350.2 350.2 350.2 350.2
gdp_deflator NaN NaN NaN NaN NaN NaN NaN
balance_trade 16.604 16.604 16.604 16.604 16.604 16.604 16.604
balance_trade_growth 14.1 14.1 14.1 14.1 14.1 14.1 14.1
usdrub 29.1525 29.0261 29.1 28.9194 29.0239 29.092 29.092
eurrub 39.2564 39.4051 39.5008 39.5233 39.3691 39.2524 39.2524
brent 84.83 84.77 84.72 86.15 87.17 85.99 85.99
net_capital_export NaN NaN NaN NaN NaN NaN NaN
gdp_annual 38807.2 38807.2 38807.2 38807.2 38807.2 38807.2 38807.2
gdp_annual_growth -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086 -0.0782086
In [171]:
hide_code
train[200:207].T[1:15]
Out[171]:
200 201 202 203 204 205 206
timestamp 2011-10-25 2011-10-25 2011-10-25 2011-10-25 2011-10-26 2011-10-26 2011-10-26
full_sq 38 33 30 76 44 35 72
life_sq 19 14 18 51 29 21 45
floor 15 8 3 2 8 5 10
max_floor NaN NaN NaN NaN NaN NaN NaN
material NaN NaN NaN NaN NaN NaN NaN
build_year NaN NaN NaN NaN NaN NaN NaN
num_room NaN NaN NaN NaN NaN NaN NaN
kitch_sq NaN NaN NaN NaN NaN NaN NaN
state NaN NaN NaN NaN NaN NaN NaN
product_type Investment Investment Investment Investment Investment Investment Investment
sub_area Horoshevskoe Juzhnoe Butovo Marfino Juzhnoportovoe Vostochnoe Izmajlovo Lefortovo Krylatskoe
area_m 8.56843e+06 2.61551e+07 2.1044e+06 4.57959e+06 3.8e+06 8.99364e+06 1.21645e+07
raion_popul 56535 178264 26943 71715 76308 89971 78507
In [172]:
hide_code
test[100:107].T[1:15]
Out[172]:
100 101 102 103 104 105 106
timestamp 2015-07-08 2015-07-08 2015-07-08 2015-07-08 2015-07-08 2015-07-08 2015-07-09
full_sq 48.9 37.5 39.8 39.4 54.99 47.8 36.6
life_sq 31.2 18.8 18.9 25.4 NaN 46.1 21.3
floor 4 11 6 1 3 8 1
max_floor 14 17 14 5 3 12 9
material 5 1 1 2 1 1 1
build_year 1974 NaN NaN 1968 2015 1975 1978
num_room 2 1 2 2 2 2 1
kitch_sq 8.9 8.4 9.8 5.4 1 6.7 6.4
state 2 2 3 3 NaN 3 2
product_type Investment Investment Investment Investment OwnerOccupier Investment Investment
sub_area Vostochnoe Degunino Krylatskoe Nekrasovka Pechatniki Poselenie Novofedorovskoe Otradnoe Bibirevo
area_m 3.87544e+06 1.21645e+07 1.13917e+07 1.84458e+07 1.48702e+08 1.00531e+07 6.40758e+06
raion_popul 94564 78507 19940 83369 6161 175518 155572

$\mathfrak {\color{#228B22} {5. \ Solution \ Statement }}$

5.1 Selection of Features

In [173]:
hide_code
X_list_num = ['full_sq', 'num_room', 'floor', 'area_m', 
              'timestamp',
              'preschool_education_centers_raion', 'school_education_centers_raion', 
              'children_preschool', 'children_school',
              'shopping_centers_raion', 'healthcare_centers_raion', 
              'office_raion', 'sport_objects_raion',
              'public_transport_station_min_walk', 
              'railroad_station_walk_min', 'railroad_station_avto_km', 'bus_terminal_avto_km',
              'cafe_count_500',
              'kremlin_km', 'workplaces_km', 
              'ID_metro', 'metro_km_avto', 'metro_min_walk', 
              'public_healthcare_km', 'shopping_centers_km', 'big_market_km',
              'fitness_km', 'swim_pool_km', 'stadium_km', 'park_km',
              'kindergarten_km', 'school_km', 'preschool_km', 'university_km', 'additional_education_km',
              'theater_km', 'exhibition_km', 'museum_km', 
              'big_road1_km', 'big_road2_km',
              'detention_facility_km', 'cemetery_km', 'oil_chemistry_km', 'radiation_km',
              'raion_popul', 'work_all', 'young_all', 'ekder_all']

X_list_cat = ['sub_area', 'ecology', 'big_market_raion']

features_train = train[X_list_num]
features_test = test[X_list_num]
target_train = train['price_doc']
In [174]:
hide_code
plt.style.use('seaborn-whitegrid')

f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 6))

sns.distplot(target_train, bins=200, color='#348ABD', ax=ax1)
ax1.set_xlabel("Prices")

sns.distplot(np.log(target_train), bins=200, color='#348ABD', ax=ax2)
ax2.set_xlabel("Logarithm of the variable Prices")

plt.suptitle('Sberbank Russian Housing Data');
In [175]:
hide_code
print ("Sberbank Russian Housing Dataset Statistics: \n")
print ("Number of houses = ", len(target_train))
print ("Number of features = ", len(list(features_train.keys())))
print ("Minimum house price = ", np.min(target_train))
print ("Maximum house price = ", np.max(target_train))
print ("Mean house price = ", "%.2f" % np.mean(target_train))
print ("Median house price = ", "%.2f" % np.median(target_train))
print ("Standard deviation of house prices =", "%.2f" % np.std(target_train))
Sberbank Russian Housing Dataset Statistics: 

Number of houses =  30471
Number of features =  48
Minimum house price =  100000
Maximum house price =  111111112
Mean house price =  7123035.28
Median house price =  6274411.00
Standard deviation of house prices = 4780032.89

5.2 Fill in Missing Values

In [176]:
hide_code
features_train.isnull().sum()
Out[176]:
full_sq                                 0
num_room                             9572
floor                                 167
area_m                                  0
timestamp                               0
preschool_education_centers_raion       0
school_education_centers_raion          0
children_preschool                      0
children_school                         0
shopping_centers_raion                  0
healthcare_centers_raion                0
office_raion                            0
sport_objects_raion                     0
public_transport_station_min_walk       0
railroad_station_walk_min              25
railroad_station_avto_km                0
bus_terminal_avto_km                    0
cafe_count_500                          0
kremlin_km                              0
workplaces_km                           0
ID_metro                                0
metro_km_avto                           0
metro_min_walk                         25
public_healthcare_km                    0
shopping_centers_km                     0
big_market_km                           0
fitness_km                              0
swim_pool_km                            0
stadium_km                              0
park_km                                 0
kindergarten_km                         0
school_km                               0
preschool_km                            0
university_km                           0
additional_education_km                 0
theater_km                              0
exhibition_km                           0
museum_km                               0
big_road1_km                            0
big_road2_km                            0
detention_facility_km                   0
cemetery_km                             0
oil_chemistry_km                        0
radiation_km                            0
raion_popul                             0
work_all                                0
young_all                               0
ekder_all                               0
dtype: int64
In [177]:
hide_code
features_test.isnull().sum()
Out[177]:
full_sq                               0
num_room                              0
floor                                 0
area_m                                0
timestamp                             0
preschool_education_centers_raion     0
school_education_centers_raion        0
children_preschool                    0
children_school                       0
shopping_centers_raion                0
healthcare_centers_raion              0
office_raion                          0
sport_objects_raion                   0
public_transport_station_min_walk     0
railroad_station_walk_min            34
railroad_station_avto_km              0
bus_terminal_avto_km                  0
cafe_count_500                        0
kremlin_km                            0
workplaces_km                         0
ID_metro                              0
metro_km_avto                         0
metro_min_walk                       34
public_healthcare_km                  0
shopping_centers_km                   0
big_market_km                         0
fitness_km                            0
swim_pool_km                          0
stadium_km                            0
park_km                               0
kindergarten_km                       0
school_km                             0
preschool_km                          0
university_km                         0
additional_education_km               0
theater_km                            0
exhibition_km                         0
museum_km                             0
big_road1_km                          0
big_road2_km                          0
detention_facility_km                 0
cemetery_km                           0
oil_chemistry_km                      0
radiation_km                          0
raion_popul                           0
work_all                              0
young_all                             0
ekder_all                             0
dtype: int64
In [178]:
hide_code
df = pd.DataFrame(features_train, columns=X_list_num)
df['prices'] = target_train

df = df.dropna(subset=['num_room'])

df['metro_min_walk'] = df['metro_min_walk'].interpolate(method='linear')
features_test['metro_min_walk'] = features_test['metro_min_walk'].interpolate(method='linear')

df['railroad_station_walk_min'] = df['railroad_station_walk_min'].interpolate(method='linear')
features_test['railroad_station_walk_min'] = \
features_test['railroad_station_walk_min'].interpolate(method='linear')

df['floor'] = df['floor'].fillna(df['floor'].median())
len(df)
Out[178]:
20899

5.3 Categorical and Macro Features

In [179]:
hide_code
ID_metro_cat = pd.factorize(df['ID_metro'])
df['ID_metro'] = ID_metro_cat[0]
ID_metro_pairs = dict(zip(list(ID_metro_cat[1]), list(set(ID_metro_cat[0]))))
ID_metro_pairs[224] = 219
features_test['ID_metro'].replace(ID_metro_pairs, inplace=True)

macro['salary'] = macro['salary'].interpolate(method='linear')
usdrub_pairs = dict(zip(list(macro['timestamp']), list(macro['usdrub'])))
salary_pairs = dict(zip(list(macro['timestamp']), list(macro['salary'])))

df['timestamp'].replace(usdrub_pairs,inplace=True)
features_test['timestamp'].replace(usdrub_pairs, inplace=True)
df.rename(columns={'timestamp' : 'usdrub'}, inplace=True)
features_test.rename(columns={'timestamp' : 'usdrub'}, inplace=True)

5.4 Display Correlation

In [180]:
hide_code
pearson = df.corr(method='pearson')
corr_with_prices = pearson.ix[-1][:-1]
corr_with_prices[abs(corr_with_prices).argsort()[::-1]]
Out[180]:
full_sq                              0.593829
num_room                             0.476337
kremlin_km                          -0.290126
sport_objects_raion                  0.256412
ID_metro                             0.250502
stadium_km                          -0.238431
detention_facility_km               -0.233395
university_km                       -0.222964
theater_km                          -0.222873
workplaces_km                       -0.220889
swim_pool_km                        -0.220480
exhibition_km                       -0.212144
radiation_km                        -0.208256
museum_km                           -0.203846
park_km                             -0.201636
metro_min_walk                      -0.200058
fitness_km                          -0.197702
metro_km_avto                       -0.194751
school_education_centers_raion       0.193896
healthcare_centers_raion             0.185419
shopping_centers_km                 -0.182459
public_healthcare_km                -0.182388
big_road2_km                        -0.178865
bus_terminal_avto_km                -0.176601
ekder_all                            0.169331
area_m                              -0.167851
school_km                           -0.158775
preschool_education_centers_raion    0.157762
preschool_km                        -0.157079
office_raion                         0.149137
additional_education_km             -0.146074
raion_popul                          0.145984
kindergarten_km                     -0.141627
shopping_centers_raion               0.140370
work_all                             0.136761
railroad_station_walk_min           -0.135099
oil_chemistry_km                    -0.134873
children_school                      0.132915
railroad_station_avto_km            -0.132209
young_all                            0.131324
children_preschool                   0.129064
public_transport_station_min_walk   -0.128647
floor                                0.118989
cafe_count_500                       0.117084
big_road1_km                        -0.098968
usdrub                               0.069506
big_market_km                       -0.069257
cemetery_km                         -0.042413
Name: prices, dtype: float64
In [181]:
hide_code
features_list2 = corr_with_prices[abs(corr_with_prices).argsort()[::-1]][:32].index.values.tolist()
print(features_list2)
['full_sq', 'num_room', 'kremlin_km', 'sport_objects_raion', 'ID_metro', 'stadium_km', 'detention_facility_km', 'university_km', 'theater_km', 'workplaces_km', 'swim_pool_km', 'exhibition_km', 'radiation_km', 'museum_km', 'park_km', 'metro_min_walk', 'fitness_km', 'metro_km_avto', 'school_education_centers_raion', 'healthcare_centers_raion', 'shopping_centers_km', 'public_healthcare_km', 'big_road2_km', 'bus_terminal_avto_km', 'ekder_all', 'area_m', 'school_km', 'preschool_education_centers_raion', 'preschool_km', 'office_raion', 'additional_education_km', 'raion_popul']

5.5 Scale, Shuffle and Split the Data

In [182]:
hide_code
target_train = df['prices']
features_train = df.drop('prices', 1)
target_train2 = target_train
features_train2 = features_train[features_list2]
features_test2 = features_test[features_list2]

target_train = target_train.as_matrix()
features_train = features_train.as_matrix()
features_test = features_test.as_matrix()
target_train2 = target_train2.as_matrix()
features_train2 = features_train2.as_matrix()
features_test2 = features_test2.as_matrix()
In [184]:
hide_code
X_train, X_test, y_train, y_test = train_test_split(features_train, target_train, 
                                                    test_size = 0.2, random_state = 1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[184]:
((16719, 48), (4180, 48), (16719,), (4180,))
In [185]:
hide_code
X_train2, X_test2, y_train2, y_test2 = train_test_split(features_train2, target_train2, 
                                                        test_size = 0.2, random_state = 1)
X_train2.shape, X_test2.shape, y_train2.shape, y_test2.shape
Out[185]:
((16719, 32), (4180, 32), (16719,), (4180,))
In [188]:
hide_code
x_scale = RobustScaler()
X_train = x_scale.fit_transform(X_train)
X_test = x_scale.transform(X_test)

x_scale2 = RobustScaler()
X_train2 = x_scale2.fit_transform(X_train2)
X_test2 = x_scale2.transform(X_test2)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[188]:
((16719, 48), (4180, 48), (16719,), (4180,))
In [189]:
hide_code
y_scale = RobustScaler()
s_y_train = y_scale.fit_transform(y_train.reshape(-1,1))
s_y_test = y_scale.transform(y_test.reshape(-1,1))

y_scale2 = RobustScaler()
s_y_train2 = y_scale2.fit_transform(y_train2.reshape(-1,1))
s_y_test2 = y_scale2.transform(y_test2.reshape(-1,1))

s_y_train.shape, s_y_test.shape
Out[189]:
((16719, 1), (4180, 1))

$\mathfrak {\color{#228B22} {6. \ Benchmark \ Models }}$

6.1 Regressors; Scikit-Learn

Tuning Parameters

In [49]:
hide_code
param_grid_gbr = {'max_depth': [4, 5, 6], 'n_estimators': range(48, 481, 48)}
gridsearch_gbr = GridSearchCV(GradientBoostingRegressor(), 
                              param_grid_gbr, n_jobs=5).fit(X_train, y_train)

gridsearch_gbr.best_params_
Out[49]:
{'max_depth': 4, 'n_estimators': 240}
In [56]:
hide_code
param_grid_gbr2 = {'max_depth': [3, 4, 5], 'n_estimators': range(32, 321, 32)}
gridsearch_gbr2 = GridSearchCV(GradientBoostingRegressor(), 
                               param_grid_gbr2, n_jobs=5).fit(X_train2, y_train2)

gridsearch_gbr2.best_params_
Out[56]:
{'max_depth': 4, 'n_estimators': 288}
In [51]:
hide_code
param_grid_br = {'n_estimators': range(48, 481, 48)}
gridsearch_br = GridSearchCV(BaggingRegressor(), 
                             param_grid_br, n_jobs=5).fit(X_train, y_train)

gridsearch_br.best_params_
Out[51]:
{'n_estimators': 384}
In [60]:
hide_code
param_grid_br2 = {'n_estimators': range(32, 321, 32)}
gridsearch_br2 = GridSearchCV(BaggingRegressor(), 
                              param_grid_br2, n_jobs=5).fit(X_train2, y_train2)

gridsearch_br2.best_params_
Out[60]:
{'n_estimators': 128}

Fit the Regressors

In [21]:
hide_code
y_train_gbr, y_test_gbr = regression(GradientBoostingRegressor(max_depth=4, n_estimators=240), 
                                     X_train, X_test, y_train)

y_train_br, y_test_br = regression(BaggingRegressor(n_estimators=384), 
                                   X_train, X_test, y_train)
In [22]:
hide_code
print('48 features')
scores('GradientBoostingRegressor', y_train, y_test, y_train_gbr, y_test_gbr)
scores('BaggingRegressor', y_train, y_test, y_train_br, y_test_br)
48 features
_______________________________________
GradientBoostingRegressor
_______________________________________
EV score. Train:  0.84429421648
EV score. Test:  0.717222751104
---------
R2 score. Train:  0.84429421648
R2 score. Test:  0.717164021221
---------
MSE score. Train:  3.66979708455e+12
MSE score. Test:  7.33442181165e+12
---------
MAE score. Train:  1186631.54657
MAE score. Test:  1436649.28701
---------
MdAE score. Train:  641967.791817
MdAE score. Test:  717542.327713
_______________________________________
BaggingRegressor
_______________________________________
EV score. Train:  0.956966974708
EV score. Test:  0.71736365465
---------
R2 score. Train:  0.956939330205
R2 score. Test:  0.716921331922
---------
MSE score. Train:  1.01488793094e+12
MSE score. Test:  7.34071516123e+12
---------
MAE score. Train:  522721.235406
MAE score. Test:  1402249.88094
---------
MdAE score. Train:  227142.388021
MdAE score. Test:  627311.283854
In [23]:
hide_code
y_train_gbr2, y_test_gbr2 = regression(GradientBoostingRegressor(max_depth=4, n_estimators=288), 
                                       X_train2, X_test2, y_train2)

y_train_br2, y_test_br2 = regression(BaggingRegressor(n_estimators=128), 
                                     X_train2, X_test2, y_train2)
In [24]:
hide_code
print('32 features')
scores('GradientBoostingRegressor', y_train2, y_test2, y_train_gbr2, y_test_gbr2)
scores('BaggingRegressor', y_train2, y_test2, y_train_br2, y_test_br2)
32 features
_______________________________________
GradientBoostingRegressor
_______________________________________
EV score. Train:  0.843076625974
EV score. Test:  0.712536399462
---------
R2 score. Train:  0.843076625974
R2 score. Test:  0.712472530424
---------
MSE score. Train:  3.69849422082e+12
MSE score. Test:  7.45608021092e+12
---------
MAE score. Train:  1204493.91739
MAE score. Test:  1472391.76539
---------
MdAE score. Train:  659823.287316
MdAE score. Test:  757552.315071
_______________________________________
BaggingRegressor
_______________________________________
EV score. Train:  0.947997124639
EV score. Test:  0.706809940732
---------
R2 score. Train:  0.947969628448
R2 score. Test:  0.706486514343
---------
MSE score. Train:  1.22629295786e+12
MSE score. Test:  7.61130787007e+12
---------
MAE score. Train:  609307.513809
MAE score. Test:  1457041.64576
---------
MdAE score. Train:  305178.634536
MdAE score. Test:  684045.511812

MLP Regressors

In [25]:
hide_code
mlpr = MLPRegressor(hidden_layer_sizes=(240,), max_iter=200, solver='lbfgs', 
                    alpha=0.01, verbose=2)
mlpr.fit(X_train, y_train)

y_train_mlpr = mlpr.predict(X_train)
y_test_mlpr = mlpr.predict(X_test)

scores('MLP Regressor #1', y_train, y_test, y_train_mlpr, y_test_mlpr)
_______________________________________
MLP Regressor #1
_______________________________________
EV score. Train:  0.66104496028
EV score. Test:  0.661748930506
---------
R2 score. Train:  0.661004705905
R2 score. Test:  0.661403014417
---------
MSE score. Train:  7.98970926978e+12
MSE score. Test:  8.78040031238e+12
---------
MAE score. Train:  1603903.13551
MAE score. Test:  1659204.29535
---------
MdAE score. Train:  885760.038664
MdAE score. Test:  915148.376151
In [26]:
hide_code
mlpr2 = MLPRegressor(hidden_layer_sizes=(288,), max_iter=300, solver='lbfgs', 
                    alpha=0.01, verbose=2)
mlpr2.fit(X_train2, y_train2)

y_train_mlpr2 = mlpr2.predict(X_train2)
y_test_mlpr2 = mlpr2.predict(X_test2)

scores('MLP Regressor #2', y_train2, y_test2, y_train_mlpr2, y_test_mlpr2)
_______________________________________
MLP Regressor #2
_______________________________________
EV score. Train:  0.693235421025
EV score. Test:  0.686745167826
---------
R2 score. Train:  0.693204469782
R2 score. Test:  0.686666854287
---------
MSE score. Train:  7.23079976155e+12
MSE score. Test:  8.12526563329e+12
---------
MAE score. Train:  1538378.67273
MAE score. Test:  1599805.85728
---------
MdAE score. Train:  814255.86223
MdAE score. Test:  856117.094979

Display Predictions

In [27]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(y_test[1:50], color = 'black', label='Real Data')

plt.plot(y_test_gbr[1:50], label='Gradient Boosting')
plt.plot(y_test_br[1:50], label='Bagging Regressor')
plt.plot(y_test_mlpr[1:50], label='MLP Regressor')

plt.legend()
plt.title("48 Features; Regressor Predictions vs Real Data");
In [28]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(y_test2[1:50], color = 'black', label='Real Data')

plt.plot(y_test_gbr2[1:50], label='Gradient Boosting')
plt.plot(y_test_br2[1:50], label='Bagging Regressor')
plt.plot(y_test_mlpr2[1:50], label='MLP Regressor')

plt.legend()
plt.title("32 Features; Regressor Predictions vs Real Data"); 

6.2 Neural Networks; Keras

MLP

In [301]:
hide_code
def mlp_model():
    model = Sequential()
    
    model.add(Dense(48, activation='relu', input_dim=48))
    model.add(Dense(48, activation='relu'))
    
    model.add(Dropout(0.1))
    
    model.add(Dense(192, activation='relu'))
    model.add(Dense(192, activation='relu'))
    
    model.add(Dropout(0.1))
    
    model.add(Dense(768, activation='relu'))
    model.add(Dense(768, activation='relu'))
    
    model.add(Dense(1))
    
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

mlp_model = mlp_model()

mlp_history = mlp_model.fit(X_train, s_y_train, validation_data=(X_test, s_y_test),
                            nb_epoch=20, batch_size=16, verbose=0)
In [302]:
hide_code
loss_plot(mlp_history)
mae_plot(mlp_history)
In [303]:
hide_code
s_y_train_mlp = mlp_model.predict(X_train)
s_y_test_mlp = mlp_model.predict(X_test)

scores('MLP Model #1', s_y_train, s_y_test, s_y_train_mlp, s_y_test_mlp)
_______________________________________
MLP Model #1
_______________________________________
EV score. Train:  0.651685250865
EV score. Test:  0.690595786735
---------
R2 score. Train:  0.651684614715
R2 score. Test:  0.690438974259
---------
MSE score. Train:  0.633439240126
MSE score. Test:  0.619401859233
---------
MAE score. Train:  0.407406387265
MAE score. Test:  0.424039399589
---------
MdAE score. Train:  0.201703382583
MdAE score. Test:  0.214629895455
In [304]:
hide_code
mlp_model.save('mlp_model_p6.h5')
In [305]:
hide_code
def mlp_model2():
    model = Sequential()
    
    model.add(Dense(32, activation='relu', input_dim=32))
    model.add(Dense(32, activation='relu'))
    
    model.add(Dropout(0.1))
    
    model.add(Dense(128, activation='relu'))
    model.add(Dense(128, activation='relu'))
    
    model.add(Dropout(0.1))
    
    model.add(Dense(512, activation='relu'))
    model.add(Dense(512, activation='relu'))
    
    model.add(Dense(1))
    
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

mlp_model2 = mlp_model2()

mlp_history2 = mlp_model2.fit(X_train2, s_y_train2, validation_data=(X_test2, s_y_test2),
                              nb_epoch=40, batch_size=16, verbose=0)
In [306]:
hide_code
loss_plot(mlp_history2)
mae_plot(mlp_history2)
In [307]:
hide_code
s_y_train_mlp2 = mlp_model2.predict(X_train2)
s_y_test_mlp2 = mlp_model2.predict(X_test2)

scores('MLP Model #2', s_y_train2, s_y_test2, s_y_train_mlp2, s_y_test_mlp2)
_______________________________________
MLP Model #2
_______________________________________
EV score. Train:  0.647044083991
EV score. Test:  0.664392864849
---------
R2 score. Train:  0.63344719128
R2 score. Test:  0.65446314789
---------
MSE score. Train:  0.666605445613
MSE score. Test:  0.691386030003
---------
MAE score. Train:  0.444339313087
MAE score. Test:  0.459656948743
---------
MdAE score. Train:  0.223214097559
MdAE score. Test:  0.238526810277
In [308]:
hide_code
mlp_model2.save('mlp_model2_p6.h5')

CNN

In [309]:
hide_code
def cnn_model():
    model = Sequential()
        
    model.add(Conv1D(48, 5, padding='valid', activation='relu', input_shape=(48, 1)))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(0.25))

    model.add(Conv1D(192, 3, padding='valid', activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(0.25))
    
    model.add(Flatten())

    model.add(Dense(768, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))

    model.add(Dense(1, kernel_initializer='normal'))
    
#    opt = keras.optimizers.rmsprop(decay=1e-6)
    
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

cnn_model = cnn_model()
cnn_history = cnn_model.fit(X_train.reshape(16719, 48, 1), s_y_train, 
                            epochs=25, batch_size=64, verbose=0,
                            validation_data=(X_test.reshape(4180, 48, 1), s_y_test))
In [310]:
hide_code
loss_plot(cnn_history)
mae_plot(cnn_history)
In [311]:
hide_code
s_y_train_cnn = cnn_model.predict(X_train.reshape(16719, 48, 1))
s_y_test_cnn = cnn_model.predict(X_test.reshape(4180, 48, 1))

scores('CNN Model #1', s_y_train, s_y_test, s_y_train_cnn, s_y_test_cnn)
_______________________________________
CNN Model #1
_______________________________________
EV score. Train:  0.718231732205
EV score. Test:  0.679698344692
---------
R2 score. Train:  0.714944413504
R2 score. Test:  0.675778174695
---------
MSE score. Train:  0.518396263076
MSE score. Test:  0.648736710047
---------
MAE score. Train:  0.438653405757
MAE score. Test:  0.460115877486
---------
MdAE score. Train:  0.273668864833
MdAE score. Test:  0.277287479374
In [312]:
hide_code
cnn_model.save('cnn_model_p6.h5')
In [313]:
hide_code
def cnn_model2():
    model = Sequential()
        
    model.add(Conv1D(32, 5, padding='valid', activation='relu', input_shape=(32, 1)))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(0.25))

    model.add(Conv1D(128, 3, padding='valid', activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(0.25))
    
    model.add(Flatten())

    model.add(Dense(512, kernel_initializer='normal', activation='relu'))
    model.add(Dropout(0.5))

    model.add(Dense(1, kernel_initializer='normal'))
    
#    opt = keras.optimizers.rmsprop(decay=1e-6)
    
    model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
    return model

cnn_model2 = cnn_model2()
cnn_history2 = cnn_model2.fit(X_train2.reshape(16719, 32, 1), s_y_train2, 
                              epochs=30, batch_size=16, verbose=0,
                              validation_data=(X_test2.reshape(4180, 32, 1), s_y_test2))
In [314]:
hide_code
loss_plot(cnn_history2)
mae_plot(cnn_history2)
In [315]:
hide_code
s_y_train_cnn2 = cnn_model2.predict(X_train2.reshape(16719, 32, 1))
s_y_test_cnn2 = cnn_model2.predict(X_test2.reshape(4180, 32, 1))

scores('CNN Model #2', s_y_train2, s_y_test2, s_y_train_cnn2, s_y_test_cnn2)
_______________________________________
CNN Model #2
_______________________________________
EV score. Train:  0.683510035276
EV score. Test:  0.688212327849
---------
R2 score. Train:  0.682683012642
R2 score. Test:  0.687531815381
---------
MSE score. Train:  0.577066187263
MSE score. Test:  0.625218804728
---------
MAE score. Train:  0.417907165479
MAE score. Test:  0.441350834867
---------
MdAE score. Train:  0.213669048945
MdAE score. Test:  0.222344950893
In [316]:
hide_code
cnn_model2.save('cnn_model2_p6.h5')

RNN

In [317]:
hide_code
def rnn_model():
    model = Sequential()
    
    model.add(LSTM(192, return_sequences=True, input_shape=(1, 48)))
    model.add(LSTM(768, return_sequences=False))   
    
    model.add(Dense(1))

    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])     
    return model 

rnn_model = rnn_model()
rnn_history = rnn_model.fit(X_train.reshape(16719, 1, 48), s_y_train.reshape(16719), 
                            epochs=8, verbose=0, 
                            validation_data=(X_test.reshape(4180, 1, 48), s_y_test.reshape(4180)))
In [318]:
hide_code
loss_plot(rnn_history)
mae_plot(rnn_history)
In [319]:
hide_code
s_y_train_rnn = rnn_model.predict(X_train.reshape(16719, 1, 48))
s_y_test_rnn = rnn_model.predict(X_test.reshape(4180, 1, 48))

scores('RNN Model #1', s_y_train, s_y_test, s_y_train_rnn, s_y_test_rnn)
_______________________________________
RNN Model #1
_______________________________________
EV score. Train:  0.694180994458
EV score. Test:  0.669849828845
---------
R2 score. Train:  0.69410434945
R2 score. Test:  0.669505552794
---------
MSE score. Train:  0.556295577596
MSE score. Test:  0.661287623581
---------
MAE score. Train:  0.414027009292
MAE score. Test:  0.432888857164
---------
MdAE score. Train:  0.204195214113
MdAE score. Test:  0.213506053885
In [320]:
hide_code
rnn_model.save('rnn_model_p6.h5')
In [321]:
hide_code
def rnn_model2():
    model = Sequential()
    
    model.add(LSTM(128, return_sequences=True, input_shape=(1, 32)))
    model.add(LSTM(512, return_sequences=False))   
    
    model.add(Dense(1))

    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])     
    return model 

rnn_model2 = rnn_model2()
rnn_history2 = rnn_model2.fit(X_train2.reshape(16719, 1, 32), s_y_train2, 
                              epochs=8, verbose=2, 
                              validation_data=(X_test2.reshape(4180, 1, 32), s_y_test2))
Train on 16719 samples, validate on 4180 samples
Epoch 1/8
82s - loss: 0.8268 - mean_absolute_error: 0.5059 - val_loss: 0.6910 - val_mean_absolute_error: 0.4711
Epoch 2/8
35s - loss: 0.6929 - mean_absolute_error: 0.4680 - val_loss: 0.7216 - val_mean_absolute_error: 0.4617
Epoch 3/8
34s - loss: 0.6756 - mean_absolute_error: 0.4598 - val_loss: 0.7252 - val_mean_absolute_error: 0.4861
Epoch 4/8
35s - loss: 0.6553 - mean_absolute_error: 0.4549 - val_loss: 0.6434 - val_mean_absolute_error: 0.4480
Epoch 5/8
34s - loss: 0.6372 - mean_absolute_error: 0.4491 - val_loss: 0.6527 - val_mean_absolute_error: 0.4528
Epoch 6/8
36s - loss: 0.6292 - mean_absolute_error: 0.4452 - val_loss: 0.6665 - val_mean_absolute_error: 0.4505
Epoch 7/8
34s - loss: 0.6236 - mean_absolute_error: 0.4428 - val_loss: 0.6256 - val_mean_absolute_error: 0.4527
Epoch 8/8
36s - loss: 0.6042 - mean_absolute_error: 0.4388 - val_loss: 0.6481 - val_mean_absolute_error: 0.4569
In [322]:
hide_code
loss_plot(rnn_history2)
mae_plot(rnn_history2)
In [323]:
hide_code
s_y_train_rnn2 = rnn_model2.predict(X_train2.reshape(16719, 1, 32))
s_y_test_rnn2 = rnn_model2.predict(X_test2.reshape(4180, 1, 32))

scores('RNN Model #2', s_y_train2, s_y_test2, s_y_train_rnn2, s_y_test_rnn2)
_______________________________________
RNN Model #2
_______________________________________
EV score. Train:  0.685670857225
EV score. Test:  0.679597402543
---------
R2 score. Train:  0.683087203938
R2 score. Test:  0.676119583212
---------
MSE score. Train:  0.576331133233
MSE score. Test:  0.648053584419
---------
MAE score. Train:  0.443593305802
MAE score. Test:  0.456857772297
---------
MdAE score. Train:  0.247995070438
MdAE score. Test:  0.255449966322
In [324]:
hide_code
rnn_model2.save('rnn_model2_p6.h5')

Display Predictions

In [325]:
hide_code
y_train_mlp = y_scale.inverse_transform(s_y_train_mlp)
y_test_mlp = y_scale.inverse_transform(s_y_test_mlp)

y_train_cnn = y_scale.inverse_transform(s_y_train_cnn)
y_test_cnn = y_scale.inverse_transform(s_y_test_cnn)

y_train_rnn = y_scale.inverse_transform(s_y_train_rnn)
y_test_rnn = y_scale.inverse_transform(s_y_test_rnn)
##########################################################
y_train_mlp2 = y_scale2.inverse_transform(s_y_train_mlp2)
y_test_mlp2 = y_scale2.inverse_transform(s_y_test_mlp2)

y_train_cnn2 = y_scale2.inverse_transform(s_y_train_cnn2)
y_test_cnn2 = y_scale2.inverse_transform(s_y_test_cnn2)

y_train_rnn2 = y_scale2.inverse_transform(s_y_train_rnn2)
y_test_rnn2 = y_scale2.inverse_transform(s_y_test_rnn2)
In [326]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(y_test[1:50], color = 'black', label='Real Data')

plt.plot(y_test_mlp[1:50], label='MLP')
plt.plot(y_test_cnn[1:50], label='CNN')
plt.plot(y_test_rnn[1:50], label='RNN')

plt.legend()
plt.title("48 Features; Neural Network Predictions vs Real Data");
In [327]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(y_test2[1:50], color = 'black', label='Real Data')

plt.plot(y_test_mlp2[1:50], label='MLP')
plt.plot(y_test_cnn2[1:50], label='CNN')
plt.plot(y_test_rnn2[1:50], label='RNN')

plt.legend()
plt.title("32 Features; Neural Network Predictions vs Real Data");

$\mathfrak {\color{#228B22} {7. \ Evaluation \ Metrics \ and \ Predictions }}$

  • explained variance regression score
  • coefficient of determination
  • mean squared error
  • mean absolute error
  • median absolute error
In [265]:
hide_code
feature_scale = RobustScaler()
s_features_train = feature_scale.fit_transform(features_train)
s_features_test = feature_scale.transform(features_test)

target_scale = RobustScaler()
s_target_train = target_scale.fit_transform(target_train)
##################################################################
feature_scale2 = RobustScaler()
s_features_train2 = feature_scale2.fit_transform(features_train2)
s_features_test2 = feature_scale2.transform(features_test2)

target_scale2 = RobustScaler()
s_target_train2 = target_scale2.fit_transform(target_train2)

7.1 Regressors; Scikit-Learn

48 Features

In [266]:
hide_code
gbr = GradientBoostingRegressor(max_depth=4, n_estimators=240)
gbr.fit(s_features_train, target_train)

target_train_predict_gbr = gbr.predict(s_features_train)
target_test_predict_gbr = gbr.predict(s_features_test)

scores2('Gradient Boosting Regressor', target_train, target_train_predict_gbr)
_______________________________________
Gradient Boosting Regressor
_______________________________________
EV score: 0.835396125665
---------
R2 score: 0.835396125665
---------
MSE score: 3.95738887867e+12
---------
MAE score: 1219859.03004
---------
MdAE score: 660793.732442
In [267]:
hide_code
br = BaggingRegressor(n_estimators=384)
br.fit(s_features_train, target_train)

target_train_predict_br = br.predict(s_features_train)
target_test_predict_br = br.predict(s_features_test)

scores2('Bagging Regressor', target_train, target_train_predict_br)
_______________________________________
Bagging Regressor
_______________________________________
EV score: 0.958941592523
---------
R2 score: 0.958920248592
---------
MSE score: 987635023893.0
---------
MAE score: 514105.818252
---------
MdAE score: 221116.169271
In [268]:
hide_code
target_train_predict_mlpr = mlpr.predict(s_features_train)
target_test_predict_mlpr = mlpr.predict(s_features_test)

scores2('MLP Regressor', target_train, target_train_predict_mlpr)
_______________________________________
MLP Regressor
_______________________________________
EV score: 0.661052576314
---------
R2 score: 0.660919525383
---------
MSE score: 8.15213678684e+12
---------
MAE score: 1617871.14194
---------
MdAE score: 896876.338598
In [286]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_train[1:50], color = 'black', label='Real Data')

plt.plot(target_train_predict_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_predict_br[1:50], label='Bagging Regressor')
plt.plot(target_train_predict_mlpr[1:50], label='MLP Regressor')

plt.legend()
plt.title("48 Features; Regressor Train Predictions vs Real Data");
In [269]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_test_predict_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_predict_br[1:50], label='Bagging Regressor')
plt.plot(target_test_predict_mlpr[1:50], label='MLP Regressor')

plt.legend()
plt.title("48 Features; Regressor Test Predictions");

32 features

In [270]:
hide_code
gbr2 = GradientBoostingRegressor(max_depth=4, n_estimators=288)
gbr2.fit(s_features_train2, target_train2)

target_train_predict_gbr2 = gbr2.predict(s_features_train2)
target_test_predict_gbr2 = gbr2.predict(s_features_test2)

scores2('Gradient Boosting Regressor', target_train2, target_train_predict_gbr2)
_______________________________________
Gradient Boosting Regressor
_______________________________________
EV score: 0.834612329764
---------
R2 score: 0.834612329764
---------
MSE score: 3.97623281654e+12
---------
MAE score: 1237975.70044
---------
MdAE score: 665922.657359
In [271]:
hide_code
br2 = BaggingRegressor(n_estimators=128)
br2.fit(s_features_train2, target_train2)

target_train_predict_br2 = br2.predict(s_features_train2)
target_test_predict_br2 = br2.predict(s_features_test2)

scores2('Bagging Regressor', target_train2, target_train_predict_br2)
_______________________________________
Bagging Regressor
_______________________________________
EV score: 0.949770566288
---------
R2 score: 0.949742469935
---------
MSE score: 1.2082862044e+12
---------
MAE score: 609454.073131
---------
MdAE score: 304172.067708
In [272]:
hide_code
target_train_predict_mlpr2 = mlpr2.predict(s_features_train2)
target_test_predict_mlpr2 = mlpr2.predict(s_features_test2)

scores2('MLP Regressor', target_train2, target_train_predict_mlpr2)
_______________________________________
MLP Regressor
_______________________________________
EV score: 0.691683161903
---------
R2 score: 0.691679591662
---------
MSE score: 7.41260653767e+12
---------
MAE score: 1551042.92678
---------
MdAE score: 819868.405609
In [287]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_train2[1:50], color = 'black', label='Real Data')

plt.plot(target_train_predict_gbr2[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_predict_br2[1:50], label='Bagging Regressor')
plt.plot(target_train_predict_mlpr2[1:50], label='MLP Regressor')

plt.legend()
plt.title("32 Features; Regressor Train Predictions vs Real Data");
In [273]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_test_predict_gbr2[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_predict_br2[1:50], label='Bagging Regressor')
plt.plot(target_test_predict_mlpr2[1:50], label='MLP Regressor')

plt.legend()
plt.title("32 Features; Regressor Test Predictions");

7.2 Neural Networks; Keras

48 Features

In [328]:
hide_code
s_target_train_predict_mlp = mlp_model.predict(s_features_train)
s_target_test_predict_mlp = mlp_model.predict(s_features_test)

scores2('MLP #1', s_target_train, s_target_train_predict_mlp)
_______________________________________
MLP #1
_______________________________________
EV score: 0.659932976857
---------
R2 score: 0.65991589204
---------
MSE score: 0.627694086816
---------
MAE score: 0.409889233662
---------
MdAE score: 0.204511690657
In [329]:
hide_code
s_target_train_predict_cnn = cnn_model.predict(s_features_train.reshape(20899, 48, 1))
s_target_test_predict_cnn = cnn_model.predict(s_features_test.reshape(7662, 48, 1))

scores2('CNN #1', s_target_train, s_target_train_predict_cnn)
_______________________________________
CNN #1
_______________________________________
EV score: 0.709791724567
---------
R2 score: 0.706366979916
---------
MSE score: 0.541959197993
---------
MAE score: 0.442228825269
---------
MdAE score: 0.274670204096
In [330]:
hide_code
s_target_train_predict_rnn = rnn_model.predict(s_features_train.reshape(20899, 1, 48))
s_target_test_predict_rnn = rnn_model.predict(s_features_test.reshape(7662, 1, 48))

scores2('RNN #1', s_target_train, s_target_train_predict_rnn)
_______________________________________
RNN #1
_______________________________________
EV score: 0.689047916827
---------
R2 score: 0.688915940005
---------
MSE score: 0.57416862591
---------
MAE score: 0.416860225458
---------
MdAE score: 0.2055714523
In [331]:
hide_code
target_train_predict_mlp = target_scale.inverse_transform(s_target_train_predict_mlp)
target_test_predict_mlp = target_scale.inverse_transform(s_target_test_predict_mlp)

target_train_predict_cnn = target_scale.inverse_transform(s_target_train_predict_cnn)
target_test_predict_cnn = target_scale.inverse_transform(s_target_test_predict_cnn)

target_train_predict_rnn = target_scale.inverse_transform(s_target_train_predict_rnn)
target_test_predict_rnn = target_scale.inverse_transform(s_target_test_predict_rnn)
In [332]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_train[1:50], color = 'black', label='Real Data')

plt.plot(target_train_predict_mlp[1:50], label='MLP')
plt.plot(target_train_predict_cnn[1:50], label='CNN')
plt.plot(target_train_predict_rnn[1:50], label='RNN')

plt.legend()
plt.title("48 Features; Neural Network Train Predictions vs Real Data");
In [333]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_test_predict_mlp[1:50], label='MLP')
plt.plot(target_test_predict_cnn[1:50], label='CNN')
plt.plot(target_test_predict_rnn[1:50], label='RNN')

plt.legend()
plt.title("48 Features; Neural Network Test Predictions");

32 Features

In [334]:
hide_code

s_target_train_predict_mlp2 = mlp_model2.predict(s_features_train2)
s_target_test_predict_mlp2 = mlp_model2.predict(s_features_test2)

scores2('MLP #2', s_target_train2, s_target_train_predict_mlp2)
_______________________________________
MLP #2
_______________________________________
EV score: 0.650691445827
---------
R2 score: 0.638044350772
---------
MSE score: 0.668062445119
---------
MAE score: 0.445446480158
---------
MdAE score: 0.223421678064
In [335]:
hide_code

s_target_train_predict_cnn2 = cnn_model2.predict(s_features_train2.reshape(20899, 32, 1))
s_target_test_predict_cnn2 = cnn_model2.predict(s_features_test2.reshape(7662, 32, 1))

scores2('CNN #2', s_target_train2, s_target_train_predict_cnn2)
_______________________________________
CNN #2
_______________________________________
EV score: 0.684566911131
---------
R2 score: 0.683761274822
---------
MSE score: 0.583682604304
---------
MAE score: 0.421296994369
---------
MdAE score: 0.215244899004
In [336]:
hide_code

s_target_train_predict_rnn2 = rnn_model2.predict(s_features_train2.reshape(20899, 1, 32))
s_target_test_predict_rnn2 = rnn_model2.predict(s_features_test2.reshape(7662, 1, 32))

scores2('RNN #2', s_target_train2, s_target_train_predict_rnn2)
_______________________________________
RNN #2
_______________________________________
EV score: 0.684508238477
---------
R2 score: 0.681688799995
---------
MSE score: 0.587507776263
---------
MAE score: 0.445205277085
---------
MdAE score: 0.249079478908
In [337]:
hide_code
target_train_predict_mlp2 = target_scale2.inverse_transform(s_target_train_predict_mlp2)
target_test_predict_mlp2 = target_scale2.inverse_transform(s_target_test_predict_mlp2)

target_train_predict_cnn2 = target_scale2.inverse_transform(s_target_train_predict_cnn2)
target_test_predict_cnn2 = target_scale2.inverse_transform(s_target_test_predict_cnn2)

target_train_predict_rnn2 = target_scale2.inverse_transform(s_target_train_predict_rnn2)
target_test_predict_rnn2 = target_scale2.inverse_transform(s_target_test_predict_rnn2)
In [338]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_train2[1:50], color = 'black', label='Real Data')

plt.plot(target_train_predict_mlp2[1:50], label='MLP')
plt.plot(target_train_predict_cnn2[1:50], label='CNN')
plt.plot(target_train_predict_rnn2[1:50], label='RNN')

plt.legend()
plt.title("32 Features; Neural Network Train Predictions vs Real Data");
In [339]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_test_predict_mlp2[1:50], label='MLP')
plt.plot(target_test_predict_cnn2[1:50], label='CNN')
plt.plot(target_test_predict_rnn2[1:50], label='RNN')

plt.legend()
plt.title("32 Features; Neural Network Test Predictions");

Display All Predictions

In [340]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_train[1:50], color = 'black', label='Real Data')

plt.plot(target_train_predict_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_predict_br[1:50], label='Bagging Regressor')
plt.plot(target_train_predict_mlpr[1:50], label='MLP Regressor')

plt.plot(target_train_predict_mlp[1:50], label='MLP')
plt.plot(target_train_predict_cnn[1:50], label='CNN')
plt.plot(target_train_predict_rnn[1:50], label='RNN')

plt.legend()
plt.title("48 Features; Train Predictions vs Real Data");
In [341]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_test_predict_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_predict_br[1:50], label='Bagging Regressor')
plt.plot(target_test_predict_mlpr[1:50], label='MLP Regressor')

plt.plot(target_test_predict_mlp[1:50], label='MLP')
plt.plot(target_test_predict_cnn[1:50], label='CNN')
plt.plot(target_test_predict_rnn[1:50], label='RNN')

plt.legend()
plt.title("48 Features; Test Predictions");
In [342]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_train2[1:50], color = 'black', label='Real Data')

plt.plot(target_train_predict_gbr2[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_predict_br2[1:50], label='Bagging Regressor')
plt.plot(target_train_predict_mlpr2[1:50], label='MLP Regressor')

plt.plot(target_train_predict_mlp2[1:50], label='MLP')
plt.plot(target_train_predict_cnn2[1:50], label='CNN')
plt.plot(target_train_predict_rnn2[1:50], label='RNN')

plt.legend()
plt.title("32 Features; Train Predictions vs Real Data");
In [343]:
hide_code
plt.figure(figsize = (18, 6))

plt.plot(target_test_predict_gbr2[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_predict_br2[1:50], label='Bagging Regressor')
plt.plot(target_test_predict_mlpr2[1:50], label='MLP Regressor')

plt.plot(target_test_predict_mlp2[1:50], label='MLP')
plt.plot(target_test_predict_cnn2[1:50], label='CNN')
plt.plot(target_test_predict_rnn2[1:50], label='RNN')

plt.legend()
plt.title("32 Features; Test Predictions");

$\mathfrak {\color{#228B22} {8. \ Project \ Design }}$

The project was built on the basis of the competition offered on the site https://www.kaggle.com. The competition version of this notebook is avalible here: https://www.kaggle.com/olgabelitskaya/sberbank-russian-housing-market .

There are several popular resources (numpy, pandas, matplotlib, scikit-learn and keras) for regression models were used.

The most valuable in this project is the study of real data and the attempt to approximate the predictions on them to the threshold of 70-80 percent.